Red Wine Quality Data Analysis

Udacity Data Analyst Nanodegree

P4: Explore and Summarize Data

by D. Satas

October’2016


About the Dataset

This dataset is public available for research. The details are described in [Cortez et al., 2009]. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at:

  1. Elsevier

  2. bib

Objective of the Analysis

Prediction of the quality ranking by tasters from the various measured properties of red wines to guide grape growers and wine producers regarding a wine quality. Do some of these properties have a significant effect on quality? If so, which ones?

Data Overview

Variable description:

Input Variables:

  1. fixed acidity (tartaric acid - g/dm^3): most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

  2. volatile acidity (acetic acid - g/dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

  3. citric acid (g/dm^3): found in small quantities, citric acid can add “freshness” and flavor to wines

  4. residual sugar (g/dm^3): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

  5. chlorides (sodium chloride - g/dm^3): the amount of salt in the wine

  6. free sulfur dioxide (mg/dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

  7. total sulfur dioxide (mg/dm^3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

  8. density (g/cm^3): the density of water is close to that of water depending on the percent alcohol and sugar content

  9. pH (scale between 0 and 14): describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

  10. sulphates (potassium sulphate - g/dm^3): a wine additive which can contribute to sulfur dioxide gas (SO2) levels, which acts as an antimicrobial and antioxidant

  11. alcohol (% by volume): the percent alcohol content of the wine

Output Variable (based on sensory data):

  1. quality (score between 0 and 10)

Dataset modifications:

The dataframe is replaced by a subset of itself with following modifications:

  1. Added a column \(quality.f\) with the quality values as a factor type.

  2. Removed first row ID column - it doesn’t have any value to the analysis.

Quick view of the dataset statistics

The information about the structure of the dataframe and variable data types.

## 'data.frame':    1599 obs. of  13 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ quality.f           : Factor w/ 10 levels "1","2","3","4",..: 5 5 5 6 5 5 5 7 7 5 ...

The dataset consists of 13 variables with 1599 observations. There is an aditional variable \(quality.f\) created as a factor of the quality scores and will be used to create a model. The variable \(quality\) is integer type, \(quality.f\) - factor type, the rest are numeric type.


Descriptive statistics of every variable in the dataset.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##                                                                   
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##                                                            
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##                                                                   
##     quality        quality.f  
##  Min.   :3.000   5      :681  
##  1st Qu.:5.000   6      :638  
##  Median :6.000   7      :199  
##  Mean   :5.636   4      : 53  
##  3rd Qu.:6.000   8      : 18  
##  Max.   :8.000   3      : 10  
##                  (Other):  0

The summary statistics above include the mean, standard deviation, range, and percentiles. It reveals the mean for most variables is greater than the median, this indicates that there are outliers. Only \(density\) and \(ph\) have median almost the same as the mean, the sign of normal distribution. The \(quality\) min value is 3, max - 8, that might indicate that our dataset doesn’t include any measurements of worst or best quality wines. Variables \(residual.sugar\), \(chlorides\), \(free.sulfur.dioxide\), \(total.sulfur.dioxide\) have outliers very far away, since the max values are way above the 3rd quartile.


Note: To save space there is no measurement units indicated in the following plots, charts or graphs in the analysis. Please refer to the table below, if needed.

##                var_name measure_units
## 1         fixed.acidity        g/dm^3
## 2      volatile.acidity        g/dm^3
## 3           citric.acid        g/dm^3
## 4        residual.sugar        g/dm^3
## 5             chlorides        g/dm^3
## 6   free.sulfur.dioxide       mg/dm^3
## 7  total.sulfur.dioxide       mg/dm^3
## 8               density        g/cm^3
## 9                    pH    scale 0-14
## 10            sulphates        g/dm^3
## 11              alcohol             %
## 12              quality    scale 1-10
## 13            quality.f    scale 1-10

Univariate Plots & Analysis Section

The histograms and bar plot to explore the distribution of each explanatory variable. I am not sure, if they are completely independent.

**Figure 1.**

Figure 1.

As shown in Figure 1, \(quality\) variable has most values concentrated in the categories 5, 6. Only a small proportion is in the rest of categories. There is no values in category 1, 2, 9, 10. \(residual.sugar\), \(free.sulfur.dioxide\) and \(total.sulfur.dioxide\), \(sulfates\) have a positively skewed distribution. \(alcohol\) and \(citric.acid\) have an irregular shaped distributions. \(density\) and \(pH\) appears as normal distributions.


Boxplots for each of the variables.

**Figure 2.**

Figure 2.

The boxplots show distribution of variables from a different angle. As shown in Figure 2, I can see that all variables have outliers. \(free.sulphur.dioxide\), \(density\), have few outliers far away from the most of other observations. \(fixed.acidity\), \(volatile.acidity\) and \(citric.acid\) have a lot of outliers. \(Alcohol\) and \(citric.acid\) don’t not have pronounced outliers. \(density\) and \(pH\) have about normal distribution. Very skewed distributions for \(sulphates\), \(residual.sugar\) and \(chlorides\).

Bivariate Plots & Analysis Section

Plots and Analysis of Explanatory Variables

To get the overview of the relationship between variables, I produced a pairwise comparison of variables of the dataset. The column \(quality.f\) is dropped as it is a factor type variable. The graph provides two different comparisons of each pair of columns and displays either the density or count of the respective variable along the diagonal.

**Figure 3.**

Figure 3.

The plot in Figure 3 provides us with a very general idea of the correlations between variables. I picked some pairs with the highest correlation numbers to do some mere detailed analysis.


Scatterplots to pair up more interesting input values in the data set with added smoothed conditional mean, which helps in seeing patterns when overplotting.

**Figure 4.**

Figure 4.

Traditionally total acidity is divided into two groups, namely the volatile acids and the nonvolatile or fixed acids. One of the predominant fixed acids found in wines is citric acid. So it is not a surprise to see strong correlation betveen \(fixed.acidity\) and \(citric.acid\) ( 0.672 ).

There is a negative moderate correlation betwwen \(volatile.acidity\) and \(citric.acid\) ( -0.552 ). The disadvantage of adding citric acid is its microbial instability. In the European Union, use of citric acid for acidification is prohibited.


The term “sulfites” is an inclusive term for sulfur dioxide (SO2). SO2 is a preservative and widely used in winemaking because of its antioxidant and antibacterial properties. A small amount of sulfites is produced naturally as a byproduct of fermentation, but most of the SO2 has been added by the winemaker.

**Figure 5.**

Figure 5.

Total sulfur dioxide is divided into two groups: free sulfur dioxide and bound sulfur dioxide. So, again, it obvious why \(free.sulfur.dioxide\) and \(total.sulfur.dioxide\) ( 0.668 ) have a strong correlation. See Figure 5.


The measure of the amount of acidity in wine is known as the “titratable acidity” or “total acidity”, which refers to the test that yields the total of all acids present, while strength of acidity is measured according to pH, with most wines having a pH between 2.9 and 3.9.

**Figure 6.**

Figure 6.

The plot in Figure 6 shows a negative strong correlation between \(fixed.acidity\) and \(pH\) ( -0.683 ).

Plots and Analysis of Explanatory Variable vs. Response Variable

To overview the relationship between the output variable \(quality\) and all input variables I produced a scatterplot with paired up all input values with the main feature.

**Figure 7.**

Figure 7.

From the plot, it does look like \(fixed.acidity\) and \(quality.f\) has a slight positive correlation. A small number of wines of an average quality (5) has extremely high acidity. The mean for all quality levels is bigger than median, so the \(fixed.acidity\) distribution has a positive skew.


**Figure 8.**

Figure 8.

Lowest quality wine has few very high acidity values. From the plot, we can observe a negative correlation. The mean for all quality levels is bigger or equal to median, so the distribution must be a positive skew.


**Figure 9.**

Figure 9.

The variable \(citric.acid\) might have a fairly even distribution and positive corealtion. There are 2 observations with very high outliers for wines with quality level 4.


**Figure 10.**

Figure 10.

The plot above is produces with \(residual.sugar\) extreme outliers removed. Result still shows a lot of outliers in quality kategories 5-7.


**Figure 11.**

Figure 11.

This plot is also produces with \(chlorides\) extreme outliers removed. Result still shows a lot of outliers in quality kategories 5-6. Very week negative correlation.


**Figure 12.**

Figure 12.

Slightly better view with the outiers removed. The mean for all quality levels is bigger than median, so the distribution must be a positive skew.


**Figure 13.**

Figure 13.

With outliers removed, the chart reveals negative week corelaion between variables \(quality.f\) ans \(total.sulfur.dioxide\).


**Figure 14.**

Figure 14.

Density has a very small range (0.9901- 1.0037) with ouliers placed about equally to both ends of scale. The distribution is about normal.


**Figure 15.**

Figure 15.

The distribution appears normal with very few ouliers.


**Figure 16.**

Figure 16.

The extreme outliers removed, the plot still has a lot of outliers in the wine of average quality at levels 5-6.


**Figure 17.**

Figure 17.

From the plot, it apears the correalion is positive strong. Interesting distribution of amount of alcohol between levels 5 and 6. 75th percentile of alcohol of level 5 is lower than median of level 6.


Correlation Tests

To explore my data, it is best to compute both Spearman and Pearson correlations, since the relation between them might give some information. Spearman coefficient is computed on ranks and so depicts monotonic relationships while Pearson’s is on true values and depicts linear relationships.

  • Test for association between paired samples, using one of Pearson’s product moment correlation coefficient.
## # A tibble: 6 × 2
##           cor                                             pair
##         <dbl>                                            <chr>
## 1 -0.68297819             redwine$fixed.acidity and redwine$pH
## 2  0.67170343    redwine$fixed.acidity and redwine$citric.acid
## 3 -0.55249568 redwine$volatile.acidity and redwine$citric.acid
## 4  0.66804729        redwine$fixed.acidity and redwine$density
## 5  0.04207544       redwine$residual.sugar and redwine$alcohol
## 6  0.47616632              redwine$alcohol and redwine$quality

The Pearson product-moment correlation coefficient is a measure of the strength of the linear relationship between two variables. Data shown in the table above are Pearson’s Correlation coefficient and corresponding pair of variables. The numbers support our previous observations about the relationships between picked variables.

  • Spearman Rank Correlation test for association strength between the rankings of two variables.
## # A tibble: 6 × 2
##          rho                                             pair
##        <dbl>                                            <chr>
## 1 -0.7066736             redwine$fixed.acidity and redwine$pH
## 2  0.6617084    redwine$fixed.acidity and redwine$citric.acid
## 3 -0.6102595 redwine$volatile.acidity and redwine$citric.acid
## 4  0.6230708        redwine$fixed.acidity and redwine$density
## 5  0.1165481       redwine$residual.sugar and redwine$alcohol
## 6  0.4785317              redwine$alcohol and redwine$quality

Data shown in the table above are Spearman rho coefficient and corresponding pair of variables. The highest negative correlation is calculated between \(fixed.acidity\) and \(pH\), the highest positive correlation is for \(fixed.acidity\) and \(citric.acid\) pair.

Multivariate Plots & Analysis Section

Multinomial Logistic Regression Model

I will be using Multinomial Logistic Regression to model nominal outcome variable, in which the log odds of the outcomes are modeled as a linear combination of the predictor variables. I begin the analysis by including all variables and all interactions between those variables.

## 
## Call:
## glm(formula = quality.f ~ fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + density + pH + sulphates + alcohol, 
##     family = binomial(link = "logit"), data = redwine)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.96787   0.00752   0.02341   0.05730   1.22446  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           494.89510  523.99352   0.944 0.344931    
## fixed.acidity          -0.28791    0.67009  -0.430 0.667446    
## volatile.acidity       -8.40765    2.50988  -3.350 0.000809 ***
## citric.acid            -3.70698    3.92708  -0.944 0.345195    
## residual.sugar          0.14205    0.29387   0.483 0.628827    
## chlorides             -13.03262    7.00680  -1.860 0.062886 .  
## free.sulfur.dioxide    -0.15367    0.08888  -1.729 0.083823 .  
## total.sulfur.dioxide    0.09925    0.04981   1.992 0.046322 *  
## density              -470.45027  533.58716  -0.882 0.377953    
## pH                     -8.01302    4.80305  -1.668 0.095253 .  
## sulphates               2.69403    3.47425   0.775 0.438088    
## alcohol                 1.32310    0.77934   1.698 0.089563 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 121.428  on 1598  degrees of freedom
## Residual deviance:  69.165  on 1587  degrees of freedom
## AIC: 93.165
## 
## Number of Fisher Scoring iterations: 10

The Multinomial Logistic Regression Model result table reveals the most influential variables to the quality by adding the significance symbols on the side of the p-value. The lowest p-value 0.000809 has \(volatile.acidity\), it is marked with 3 stars “*“.


To select a set of predictor variables from the set I performed the Stepwise Variable Selection. This is one of the available options to confirm the previous findings.

## 
## Call:
## glm(formula = quality.f ~ volatile.acidity + citric.acid + free.sulfur.dioxide + 
##     total.sulfur.dioxide + density + pH + alcohol, family = binomial(link = "logit"), 
##     data = redwine)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2580   0.0076   0.0244   0.0607   1.1836  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           474.57117  267.88706   1.772   0.0765 .  
## volatile.acidity       -9.64610    2.08812  -4.620 3.85e-06 ***
## citric.acid            -5.89262    3.00918  -1.958   0.0502 .  
## free.sulfur.dioxide    -0.14989    0.08004  -1.873   0.0611 .  
## total.sulfur.dioxide    0.10963    0.04756   2.305   0.0212 *  
## density              -458.93682  266.23979  -1.724   0.0847 .  
## pH                     -6.41360    3.57637  -1.793   0.0729 .  
## alcohol                 1.59324    0.67825   2.349   0.0188 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 121.428  on 1598  degrees of freedom
## Residual deviance:  72.299  on 1591  degrees of freedom
## AIC: 88.299
## 
## Number of Fisher Scoring iterations: 10

The selection of variables, p-values and significance codes slightly varies from the Multinomial Logistic Regression Model results, but it confirms the general trend. First of all, I can see that out of 11 input variables 4 variables are not statistically significant.

As for the statistically significant variables \(total.sulfur.dioxide\), \(alcohol\), \(volatile.acidity\), the former has the lowest p-value suggesting a strong association with the probability of having higher quality of wine. The negative coefficient for this predictor suggests that all other variables being equal, with less \(volatile.acidity\) the outcome less likely will have higher quality.

Multivariate Plot and Analysis

From the variable selection table I can see that \(volatile.acidity\) and \(alcohol\) have lowest p-values, so in dataset they might have the biggest input to the final \(quality\) result.

**Figure 18.**

Figure 18.

In the Figure 8 the plot in of the distribution of \(volatile.acidity\) vs \(alcohol\) reveals quite clearly the clustering by color-coded quality.

Final Plots and Summary

Plot One: Distribution of Red Wine Quality

**Figure 19.**

Figure 19.

Summary of the \(quality.f\) variable

##   1   2   3   4   5   6   7   8   9  10 
##   0   0  10  53 681 638 199  18   0   0

Description One

As shown in plotted histogram in Figure 9 and summary, \(quality.f\) variable has most values concentrated in the categories 5, 6. Only a small proportion is in the rest of categories. There are no values in category 1, 2, 9, 10. That means in the sample of tested wines, there wasn’t any very bad or very good wines presented for the testing. This makes me question the credibility of the data set.

Plot Two: Correlation Between Objective Parameters

**Figure 20.**

Figure 20.

Description Two

As shown in Figure 10., \(free.sulfur.dioxide\) and \(total.sulfur.dioxide\) variables show the strongest correlation among all wine parameters (see Spearman Rank Correlation table ) and it equals to 0.789.

From the chart, it does look like there might be a threshold of about 100 for higher quality wines. But I’m not sure that the chart shows that low quality wines have higher sulphur dioxide. Most of the low quality wine is clustered in the upper or lower portion of the graph, while high quality wine is around mid-left region.

Plot Three: Distribution of Alcohol vs. Volatile Acidity

**Figure 21.**

Figure 21.

Description Three

The \(volatile.acidity\) of the wines is one of the best predictors of the quality. The clustering seen in the chart Figure 11, we might say it can be used to predict the \(quality\) of a red wine given \(volatile.acidity\) and \(alcohol\) values. The best quality wines have lower levels of the volatile acidity, and alcohol level above 10. Regression lines depict the separation for different quality ratings.


Reflection

Wine chemistry explains the flavor, balance and color of wine. Although tastes vary from person to person, some wines are better than others, and most people would probably recognize a good wine from a bad one.

My exploration and analysis process started looking for more information on the wine chemistry basics, fermentation process, and additives which help to improve the quality of wines. But my biggest struggles 1) was selecting testing methods, predictive models based on my data type. Regression analysis includes many techniques for modeling. 2) the actual analysis, interpreting and describing the result of the plot. My conclusion: the tester decisions on wine quality levels are based on their personal testes. Only very few variables have strong correlation with quality of wine. And here is my concern. Wine chemistry is very complex. A notion in wine industry is accepted that the balance of taste and chemical ingredients is as follows:

Sweet Taste (sugars + alcohols) <= => Acid Taste (acids) + Bitter Taste (phenols)

Can we draw any conclusion about the relationship between the quality and the chemical compunds in wine, since we are presented with measurements of a small portion of elements?

Also, as the quality levels of our dataset show, the sample of tested wines did not include any very bad or very good quality wines. It might mean the sample is not random, which makes me question the analysis and any of my findings, which might be very well inaccurate.

I take this analysis as good practice to learn R language and RStudio, and deepen my knowledge in statistics.

Resources

http://www.calwineries.com/learn/wine-chemistry/

https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf